arXiv:l506.03958vl [stat.ML] 12Jun2015 


Robust Structured Low-Rank Approximation on the 

Grassmannian 


Clemens Hage and Martin Kleinsteuber 

Department of Electrical Engineering and Information Technology 
Technische Universitat Miinchen 
Arcisstr. 21, 80333 Munich, Germany 
(hage, kleinsteuber)@tum. de, 
http://www.gol.ei.tum.de 


Abstract. Over the past years Robust PCA has been established as a standard 
tool for reliable low-rank approximation of matrices in the presence of outliers. 
Recently, the Robust PCA approach via nuclear norm minimization has been ex¬ 
tended to matrices with linear structures which appear in applications such as 
system identification and data series analysis. At the same time it has been shown 
how to control the rank of a structured approximation via matrix factorization 
approaches. The drawbacks of these methods either lie in the lack of robust¬ 
ness against outliers or in their static nature of repeated batch-processing. We 
present a Robust Structured Low-Rank Approximation method on the Grassman¬ 
nian that on the one hand allows for fast re-initialization in an online setting due 
to subspace identification with manifolds, and that is robust against outliers due 
to a smooth approximation of the (/,-norm cost function on the other hand. The 
method is evaluated in online time series forecasting tasks on simulated and real- 
world data. 


1 Introduction 

Many applications such as system identification and time series analysis motivate the 
problem of Structured Low-Rank Approximation (SLRA). While common low-rank 
approximations like PCA aim to find a low-dimensional subspace to represent high¬ 
dimensional data optimally with respect to some norm or divergence, in the structured 
case this problem is extended by the additional constraint that the low-rank approxima¬ 
tion has to meet a certain linear structure (Hankel, Toeplitz, Sylvester). 

For the prominent case of Hankel matrices a method dubbed Singular Spectrum 
Analysis (SSA) (4) has been presented, which performs the simplest way of SLRA in 
that it computes a low-rank approximation of a (Hankel-) structured matrix followed by 
a so-called diagonal averaging step, which is the projection onto the space of Hankel 
matrices. An obvious drawback of this method is that this projection can destroy the 
low-rank property established before. The method of Cadzow |6| alternates between 
these two steps until the algorithm converges to a solution that is indeed low-rank and 
structured. However, as Chu et al. |8) and Markovsky Ifl2l state, this solution can be 
far away from the initialization with no guarantees of finding an actually meaningful 
approximation to the data. Recently, Ishteva et al. mi have proposed a factorization 


approach with a cost function that joints the structural and low-rank constraint. The 
dimension of the two matrix factors is an upper bound on the rank of the approxima¬ 
tion and its structure is enforced with a side condition. The approximation is fitted to 
the data according to an f^-norm, although it is well known that low-rank approxima¬ 
tions of this kind can be vulnerable against outliers. For the unstructured case this has 
been the major incentive to move from PCA to robustified PCA methods such as the 
Robust PCA method by Candes et al. CD which recovers a subspace in the presence 
of sparse outliers of great magnitude. Ayazoglu et al. JT| have proposed to extend this 
concept to structured matrices by introducing additional Lagrangian multipliers. Their 
method is called Structured Robust PCA (SRPCA), and its performance is demonstrated 
in visual applications like Target Location Prediction, Tracklet Matching (both matrix 
completion problems) and Outlier Removal from trajectories, which can be interpreted 
as outlier identification through robust subspace estimation. 

One of the drawbacks of many Robust PCA approaches and thus also the SRPCA 
approach is their batch-processing nature. In an online setting the algorithm needs to be 
re-run from scratch if the data set grows or changes over time. This can be alleviated 
by factorizing the low-rank approximation of data into an orthogonal matrix represent¬ 
ing the subspace and a coefficient matrix containing the coordinates of the currently 
observed data in this subspace, cf. C191 for the unstructured case. Whenever new data 
comes in, it is possible to initialize the subspace optimization with the previous esti¬ 
mate and to update both the subspace and the coordinates to the new data. Obviously, 
whenever the subspace does not change significantly this saves computational effort 
compared to a random initialization. We will firstly derive a batch algorithm for Ro¬ 
bust Structured Low-Rank Approximation on the Grassmannian and then outline how 
to process structured data online in an efficient way. We illustrate the performance of 
the proposed algorithm on several time series forecasting tasks. 


2 Robust Structured Low-Rank Approximation on the 
Grassmannian using a smoothed f^-norm cost function 

2.1 Low-rank and sparsity constraints in the unstructured case 

Low-rank approximation of data is a well-studied problem, cf. IT3ll for a recent overview. 
In the past years a a trend can be seen towards Robust PCA methods that are tolerant 
against outliers in the data. An often considered data model is X = L + S, where the 
input data X e R mx " is assumed to be composed of a low-rank part L with rank (L) < k 
and a sparse matrix S that contains few non-zero entries. While in 0 the low-rank con¬ 
straint is enforced via minimization of the nuclear norm, in this work we will consider 
the Robust PCA setting on the Grassmannian 

Gr,. m := {P e IT xm |P = UU T , U e St*.,,,}. (1) 

The subspace is hereby represented by an element U of the Stiefel manifold St k, m = {Ue 
B"' x/: [U' IJ = L), where L is the (k x B)-identity matrix. The low-rank approximation is 
then L = UY with Y e R kxn . 


In contrast to the £>-norm in common PCA, robust approaches use relaxations of 
the f’o-norm for fitting the low-rank approximation to the data, as gross outliers might 
otherwise distort the estimation. The smoothed C p -norm 
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presented in |9) behaves similarly to the fo-norm for sparse outliers, but treats small 
additive Gaussian noise like an £2 norm. A Robust Low-Rank Approximation problem 
using the sparsifying function ([2]) writes as 

(U, Y) = arg min hJX - UY), L = UY. (3) 

v ' UeSt„,, t ,YeR t!< " 

The sparse component can be recovered via S = X - L, possibly followed by a thresh¬ 
olding operation to remove residual noise. 

2.2 Extension to linear matrix structures and algorithmic description 

Besides the low-rank and sparsity decomposition no other assumptions are made on the 
data in ([3]). In many applications however, stmctured matrices like Hankel, Sylvester or 
Toeplitz matrices play an important role. Following the notation of EL we denote by 
S a linear matrix structure, by 5(d) a structured matrix obtained from a data series d 
and by Vs (X) the orthogonal projection of any (possibly unstructured) matrix X onto 
the image of S w.r.t. the standard inner product. For example, if H is a Hankel structure 
the orthogonal projection Vu is equivalent to the diagonal averaging step in SSA (4|- 

In order to include the structural constraint we extend the cost function 0 by the 
side condition L e S. This motivates the Lagrangian Multiplier scheme 

min h M (X - UY) + (A, UY - V s (UY)) + f ||UY - V s (UY) f F . (4) 

UESt,„, t .Y£R*x\A£R" ,>< " 

Algorithm [I] outlines the extension of the Robust PCA method from Hage and Kle- 
insteuber a to the case of stmctured matrices. The algorithm considers a partial ob¬ 
servation X of the data with A defining which entries are actually observed. In each 
iteration, three steps are performed. Firstly, the subspace estimate is updated. In this 
realization the subspace is uniquely identified with a Grassmannian projector P = UU T 
with U e StThe optimization problem is solved via Conjugate Gradient (CG) de¬ 
scent with backtracking line-search using a QR-decomposition based retraction on the 
Grassmannian. Once P and thereby U have been found the coordinates Y are updated 
with a CG method in Euclidean space, such that a new optimum low-rank estimate 
L = UY is found. In a third step the Lagrangian multiplier A is updated, then p is in¬ 
creased and p is decreased. While p controls the behavior of the sparsifying function 
hfj(-), the parameter p weighs between the data fitting term and the structural side con¬ 
dition. More precisely, as long as p is small the Robust PCA term is the leading power 
and a low-rank approximation is fitted to the data. With increasing p the structural con¬ 
dition is more and more enforced until it is the dominating term in the cost function. 


In a practical application the optimiza¬ 
tion can also be terminated if the residual 
Hankel penalty e, i.e. the Frobenius dis¬ 
tance to the next Hankel-structured ma¬ 
trix normalized by the number of matrix 
entries falls below a certain threshold r. 


3 Efficient Online Time Series 
Forecasting via Robust SLRA 


Algorithm 1 Alternating minimization 
scheme for Grassmannian Robust Struc¬ 
tured Low-Rank Approximation 

Initialize: 

Choose X 0 e R mx ", s.t. A(X 0 ) = X. 

Initialize U {01 randomly or from k left singular 
vectors of X 0 . 


Y (°) = U <0,T X 0 , 
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We have outlined how to use manifold 
optimization for Robust SLRA on the 
Grassmannian. In the important case of 
a Hankel structure SLRA corresponds 
to identifying an LTI system, cf. Ifl3l . 

In practical applications, however, an 
observed system might be time-variant. 

Or the observed data is not related to 
a physical system at all but still ex¬ 
hibits repetitive or periodic behavior. The 
field of Time Series Analysis HI deals 
with these signals and numerous auto¬ 
regressive methods for filtering and fore¬ 
casting data series exist. In the SLRA 
context a low-rank Hankel matrix (and 
thus an LTI) is fitted to the observed data 
and the future development is extrapo¬ 
lated from this approximation. Thereby, 
the rank bounds the complexity of the ap¬ 
proximation. If the behavior of the data 
changes over time a new model needs 
to be determined for each observation 
instance. For our proposed Grassman¬ 
nian Robust SLRA method this means 
that both a new subspace and new co¬ 
ordinates need to be computed. How¬ 
ever, when the signal characteristics vary 

moderately over time it is likely that the new subspace lies close to the pre¬ 
viously found one. Therefore, the subspace should not be randomly initialized 
but rather updated with the new data point, in a similar way as Robust Sub¬ 
space Tracking (DSI, 03) in the unstructured case. As discussed earlier, how¬ 
ever, we do not optimize directly on the space of structured low-rank matrices. In¬ 
stead we relax the structural constraint, update the subspace and then tighten the 
structural side condition again by varying the parameter p in the cost function. 
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Algorithm 2 Online Time Series Forecast¬ 
ing via Grassmannian Robust SLRA 

Input: data series d e M. N , system order k, 
analysis dimension m, forecasting range r 
for j = 2m : N do 

Define x (/) = [d(j - (2m - 1): j) T | 0j] T 
and A according to forecasting range 

Obtain Xy) = H(x U) ) 

Initialize U <0) = U, 


(0) V (0) 


: U (U Y 


Y (0) = U (0)T X o) , Lm = 

p(0) _ jj(0)|j(0)T 

Select number of iterations At) 
Choose p (0) and p (,) 

( a) ri/dot-b 
Compute c pU) = j 

for i = 1 


: /,,) do 


In Algorithm[2]we describe an Online 
Time Series Forecasting method based 
on Robust Structured Low-Rank Ap¬ 
proximation on the Grassmannian. The 
algorithm receives as inputs the time se¬ 
ries of data d to be analyzed as well as the 
desired order of the system and the fore¬ 
casting range, i.e. the number of samples 
to be predicted. Since a Hankel matrix 
of size m x m contains (2m - 1) samples 
of data, the prediction starts at d(2m). A 
data vector X(p of length (2m - 1) is ex¬ 
tracted from the data up to the present po¬ 
sition, padded with zeros according to the 
forecasting range and structured to form 
a Hankel matrix. The first subspace esti¬ 
mate U t0) is not initialized randomly but 
with U( ; _|), the final subspace estimate 
of the previous set of data samples. Note 
that the subscript (j) counts the position 
of the current set of data samples in the 
data stream while in Algorithm [T] the su¬ 
perscript (i) denoted the iteration count 
of the alternating minimization steps in 
the batch process. Accordingly, for each 
set of data samples at position j the num¬ 
ber of alternating minimization steps At) for the current estimation must be set before¬ 
hand. This number varies between a predefined and I max , and the choice is based 
on the residual Hankel penalty €(j-i) of the previous iteration. This corresponds to the 
observation that significant changes in the system behavior lead to higher values for e 
and require more iterations in the optimization process, whereas less update steps are 
required if the subspace changes slowly or does not change at all. We start with the pre¬ 
vious iterate on the Stiefel manifold and execute the three steps from Algorithm[T]in an 
alternating manner until convergence. Notice that the cost function parameter /r is fixed 
here and only the Lagrangian parameter p is changed in each iteration to steer between 
data fit and structure. The forecasting is realized as a robust matrix completion prob¬ 
lem, i.e. the respective entries in the lower right corner of X are considered unobserved 
entries. Once a structured low-rank estimate has been found, the predicted entries of d 
can easily be read from the last entries of l (/l . 


Subspace Step from Alg. [T| 
Coordinate Step from AlglQl 
Multiplier Update from Algi[T| 
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4 Experimental results 


In a first experiment we evaluate our method on an impulse response prediction task for 
a noisy observation of a simulated SISO Linear Time Varying (LTV) system. The data 






Fig. 1 . Forecasting of 3 samples of a SISO-LTV impulse response with additive noise and ouliers 


is generated via 

x(t + 1) = A(/)x(f) + b T u(f), A<7) = e omtZ , Z T = -Z 
y (0 = c T x(f) + n(0 

with b, c being uniform random vectors and Z being a random skew-symmetric matrix. 
The additive observation noise n(f) contains of two parts, Gaussian noise with cr = 0.01 
and randomly appearing (rate 0.05) salt and pepper noise samples that take on values 
of +0.5. The degree of the system is chosen as k = 5 and we generate the impulse 
response of the system for 300 samples. Our method is compared to the SLRA method 
from CD and both algorithms are implemented in MATLAB on a desktop computer. 
We predict three time steps into the future from an observation of 2m - 1 samples, 
and the parameters are empirically chosen as m = 20, p e [ 10 6 , 10] (both methods), 
p = 0.5,// = 0.005, r = 5x 10 4 . The SLRA method is randomly initialized in each step, 
converges within 30 iterations and requires about 0.4 seconds. The iteration number of 
our method varies between = 16 and I max = 128 iterations with an average of 
22 iterations that add up to 0.7 seconds per forecasting step. The forecasting results in 
Figure [I] indie ate that both methods are able to cope with the Gaussian noise quite well 
and predict the system behavior quite reliably, but the spurious outliers introduce errors 
in the SLRA extrapolation due to the G-error weighting. Our proposed method is much 
more robust at the price of a higher computational effort. However, due to the beneficial 
subspace initialization the computation time is still competitive. 

In a second experiment we compare our method on real-world data with SLRA and 
the forecast routine in MATLAB with a 12-month-seasonal ARIMA(0,1,1) modeQ 
The time series is the well-known Airline Passenger dataset from 0 normalized to the 
range [0 1], The upper bound on the rank of the approximation is chosen as k = 8, 
and we forecast 6 samples from 2m - 1 samples with m = 18, which corresponds to 
projecting the monthly amount of passengers half a year into the future from observing 
the past three years. Figure [2] shows that all three methods succeed in forecasting the 
data, with average absolute deviations of 0.060 for SLRA, 0.036 for ARIMA and 0.044 
for our method. On average, the ARIMA implementation requires 1.3s, SLRA 0.7s and 

1 http://mathworks.com/help/econ/forecast-airline-passenger-counts.html 





























Fig. 2. Six month forecast of monthly airline passenger data from the years 1952-1960 based on 
3 year observation period. 



1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 


Fig. 3. Six month forecast of monthly airline passenger data from the years 1996-2009 based on 
3 year observation period. 


our method (/„„■„ = 8,/ m< = 64, 12 iterations on average) is the fastest with 0.3s. 
The popularity of this well-known but also well-behaving dataset has inspired us to 
perform another experiment on airline passenger data. We have obtained the system- 
wide (domestic and international) number of passenger emplanements in the USA for 
the years 1996 - 2014 from the American Bureau of Transportation Statistics (5). Due 
to the dramatic developments in the year 2001 this data is obviously more challenging. 
Figure [3] shows the dataset and the six month forecasts of the three compared methods 
with the experimental setup as before. The seasonal ARIMA model copes best with the 
challenging conditions (average absolute error of 0.086), but it needs to be noticed that 
the actual seasonality is known a priori while SLRA and our method do not have this 
information. As before, the SLRA method is able to forecast data reliably under good 
conditions but suffers from the gross outliers, resulting in an average absolute error 
of 0.182. Finally, our method shows more robustness against gross outliers (average 
absolute error of 0.102), although in this real-world example the low-rank and sparse 
data model is not exactly met. 

























5 Conclusion 


We have presented a novel method for Robust Structured Low-Rank Approximation on 
the Grassmannian. Using an approximated C p - norm, the method robustly fits an approx¬ 
imation of upper-bounded rank and linear structure to the given data. For the special 
case of a Hankel structure we have furthermore shown how to use the developed con¬ 
cept for Robust Online Time Series Forecasting. We have shown how to benefit from 
the manifold setting in online processing, as we can increase the efficiency by re-using 
the previously identified subspace. Experimental results show that our method performs 
effectively and efficiently in simulated and real-world applications. 
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